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INVARIANT P- VALUES FOR MODEL CHECKING 

By Michael Evans and Gun Ho Jang 

University of Toronto 

P-values have been the focus of considerable criticism based on 
various considerations. Still, the P-value represents one of the most 
commonly used statistical tools. When assessing the suitability of a 
single hypothesized distribution, it is not clear that there is a better 
choice for a measure of surprise. This paper is concerned with the 
definition of appropriate model-based P-values for model checking. 

1. Introduction. The use of P-values is common in statistical practice. 
Despite this, it is reasonable to say that the logical foundations for the P- 
value are somewhat weak. This has lead to a variety of criticisms of P-values 
and even to doubts as to their correctness; see, for example, the discussions in 
[3-5, 9, 12, 13] and [15]. For example, [9] is concerned with the use of the 5% 
cut-off as a standard for determining whether or not a result is "significant," 
while [3] and [4] argue that frequentist P-values can be misleading. 

While various arguments have been advanced for alternatives to P-values, 
there are situations where the use of P-values seems unavoidable. One such 
context arises with model checking, where we have observed data xq £ X and 
want to assess whether or not xq is a plausible value from a fixed probability 
measure P. For example, in model checking, P could arise as the conditional 
distribution of the observed data given a minimal sufficient statistic or as 
the distribution of an ancillary statistic such as a function of residuals. If 
the P-value leads us to doubt that xo could have arisen from P, then we 
also have reason to doubt the underlying statistical model. 

So, the basic problem we consider is to determine a measure of how sur- 
prising the observed value xo is as a possible value from P. A common ap- 
proach to this is to say that we need to prescribe a real-valued discrepancy 
statistic T: X — > B , so that, in some sense, T(x) measures how divergent 
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the value x is, and then to compute the P-value 



(1) 



P(T(x)>T(x )). 



If (1) is small, then we are lead to doubt P. In general, no guidance is 
provided as to how the statistic T is to be chosen with respect to P. For 
example, we do not have a likelihood ratio available in this situation as a 
possible choice of T. Further, it can be noted that some restrictions on T 
are necessary if (1) is to have an appropriate interpretation. In particular, 
the right tail of the distribution of T should be the only region that has 
relatively low probability. Otherwise, we could have a value of T(xq) in the 
left tail of the induced measure Pt or near a shallow anti-mode of Pt, that 
leads to a reasonable value of (1) — and yet T(xo) could still be considered 
as surprising. 

In the case where T has a discrete probability distribution, there is a 
natural definition of a P-value that avoids these problems, namely, 



where px is the probability function of Pt- In this case, we see that (2) is 
the probability of obtaining a value of T with probability of occurrence no 
greater than the probability of what was actually observed. If this probabil- 
ity is small, then T{xq) is a surprising value. We immediately see that (2) 
will identify values of T in either tail, or near shallow anti-modes, as being 
surprising. In fact, there is no need to require that T be real-valued for (2) 
to make sense. 

While (2) seems like a very natural definition, a serious problem arises 
when we attempt to generalize this to the situation where the distribution 
of T is absolutely continuous, say with density fx with respect to some 
support measure on the range space T of T. Intuitively, we would like to 
use the analog of (2) given by 



But now suppose that we have a 1—1, smooth transformation W:T— >T. 
The density of W is f w (w) = /t(^ _1 H) J W {W- l {w)), where J w (t) is the 
reciprocal of the Jacobian determinant of W at t. The P-value based on the 



density of W is P(f w (WoT(x)) < f w (WoT(x ))) = P(f T (T(x))J w (T(x)) < 



fx(T(xo))Jw(T(x))) and it is clear that, unless J\y(T(x))) is constant, this 
will not equal (3). In fact, these values can be quite different. We refer to 
this as the noninvariance of the P- value given by (3). 

It is the goal of this paper to provide a definition of a P-value for the 
absolutely continuous case that is invariant, one which proceeds logically 
from the natural definition in the discrete case as given by (2). We base our 
argument on the intuitively reasonable idea that any absolutely continuous 



(2) 



P(pt(T(x))< Pt (T(xo))), 



(3) 



P(f T (T(x))<f T (T(x ))). 
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model is, in fact, an approximation of an underlying model that is discrete, 
as it is based on a measurement process for a response that has finite ac- 
curacy. When we take this approximation into account and adjust for any 
volume distortions induced by transformations such as discrepancy statistics 
T, we are led to an invariant P-value. The central idea is that we do not 
want inferences in statistical problems to be based on changes in volume as 
induced by transformations and so we must adjust the density fx appropri- 
ately in (3) to avoid this. We provide the basic definition and argument in 
Section 2, along with examples that support our approach. In Section 3, we 
apply these results to model checking problems. In Section 4, we draw some 
conclusions. 

2. A general P-value. To start, we consider a response x taking values in 
an open subset X of R k and develop an appropriate P-value when T(x) = x. 
We suppose that the probability measure P has density / with respect to 
support measure fj, on X. 

In measure-theoretic terms, a density /, with respect to a support measure 
fj,, is seen simply as a device to compute probabilities. In statistical contexts, 
however, a density plays a somewhat more significant role. For example, if 
f(xi) > f(x2), then we want to say that the probability of x\ occurring is 
greater than the probability of xi occurring. For this to hold, we cannot 
allow / to be defined in an arbitrary fashion. In effect, we need to have 
that P(A)/ fi(A) — > f(x) as the set A converges to {x} since this implies 
that P(A) ~ f(x)fi(A) when A is close to {x}. Further, to compare the 
probabilities of two points x\ and X2, we need Ai — > {xi} with fi(Ai) = 
fJ>(A2) and then, for example, we can say that the probability of x\ occurring 
is greater than the probability of X2 occurring when f{x\) > f(x2)- The 
mathematics of making this precise is discussed in, for example, [14], under 
the topic of differentiating one measure with respect to another. 

It seems natural to choose = where is Euclidean volume on X , as 
it weights sample points equally and so f(x) expresses the essence of how the 
probability measure is behaving at x. This is analogous to using counting 
measure as the support measure in the discrete case since f{x) then has a 
direct interpretation as the probability of x. 

We now consider the essential discreteness of the observational process. 
Suppose that this translates into a value x lying in a set B n {x) such that 
{B n {x)x:x S X} forms a partition of X with /j,k(B n (x)) finite and constant 
in x and such that B n (x) shrinks "nicely" (see [14]) to x as n — > oo. We then 
have that P(B n (x))/ fik(B n (x)) — > f(x) as n— > oo as long as / is continuous 
at x. So, for n large, P(B n (x)) « f(x)nk(B n (x)) and f(x) serves as surrogate 
for the probability of x, at least when we are comparing the probabilities of 
different values of x occurring. Note that the constancy of /j,k(B n (x)) in x 
is necessary for this interpretation of f(x). As a particular example of this, 
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suppose that X = R, that we partition R 1 using {((i — 1) /n,i/n] :i G Z} and 
that B n (x) is the set ((i — l)/n,i/n] that contains x. 

Rather than observing x, the essential discreteness of the problem means 
that we will observe some x n (x) G B n {x) and the probability of observing 
x n (x) is P(B n (x)). Note that, implicitly, xo is one of the values assumed by 
x n . Then, for the discrete response variable x n , the appropriate P- value (2) 
is given by 



We then want to show that (3), with /t = / , approximates (4). 

While such a result seems intuitively plausible, a general proof is not 
straightforward. We require some regularity conditions as we cannot expect 
such an approximation to hold if we allow / and the partition {B n (x) :x £ 
X} to be too general. For this, we use the theory of contented sets and 
functions, as discussed in [10], where it is used to develop the Riemann in- 
tegral. Essentially, a bounded set A is contented if its ^-measure can be 
approximated arbitrarily closely by the /ifc-measure of a finite union of dis- 
joint rectangles contained in A and also by the /x^-measure of a finite union 
of disjoint rectangles containing A. A bounded function / with compact 
support is contented if it can be approximated arbitrarily closely by step 
functions based on contented sets. Further, we say that a function / is lo- 
cally constant at x if we can find an open set containing x on which / is 
constant. For xq G X , let LC(xq) = {x : f(x) = f(xo),f is locally constant at 
x}. In [6], we provide the proof of the following result. 

Theorem 1. Suppose that: 

(i) X is a contented subset of R k with positive content; 

(ii) B n (x) is a rectangle containing x with ^{B n {x)) finite and constant 
in x and such that B n {x) shrinks nicely to x as n— > oo; 

(iii) {B n (x) :x G R k } forms a partition of R k with {B n+ i(x) :x G R k } a 
subpartition of {B n (x) : x G R k } and sup xeR k diam(S„(x)) — > as n — > oo; 

(iv) f is a continuous density function on X with f~ 1 A contented for any 
interval A and such that LC(xq) is contented with ^fc(/ _1 /(xo)nLC(xo) c ) = 



Then (4) converges to P(f(x) < f(xo)) as n— >oo. 

Theorem 1 gives conditions under which the appropriate discrete P- value, 
in the sense that we will always be measuring x to some finite accuracy, is 
indeed approximated by the continuous version given by P(f(x) < f(xo)). 
Although the result will hold under weaker conditions, for example, we could 



(4) 





{x n {x) : P(B n (x))<P{B n {x ))} 



0. 
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allow / to be piecewise smooth and allow for more general sets than rect- 
angles in {B n (x) :x G R k }, the conditions specified seem to apply in typical 
applications. For a specific example where P(f(x) < f(xo)) fails to provide 
an approximation to (4) and where / is a continuous density, see [6]. Basi- 
cally, the approximation can fail when / is too oscillatory so that (iv) does 
not hold. The example indicates, however, that this is more of a mathemat- 
ical pathology than something we would encounter in a typical application. 
Condition (iv) can be substantially weakened if P(f(x) = c) = for every 
c. In general, the distribution of f(x) can have a discrete component, but 
our conditions imply that this can only arise by / being constant on sets 
where it is locally constant. Also, Theorem 1 requires that the accuracy of 
the discretization is effectively the same across the sample space. In certain 
situations, we may want to allow this accuracy to vary across X so that we 
could obtain a different approximation to (4), but we do not pursue this 
issue further here. 

We now consider basing the P-value on a general discrepancy statistic T. 
The question then is: given the discretization on X as determined by the 
measurement process, how should we take this into account? For, even if T 
is 1—1, it will give rise to volume distortions and we do not want these to 
affect our P- value. 

Suppose, first, that X and T are open subsets of R k and that T is 1-1 
and smooth. A partition element B n {x) C X with measure Hk(B n (x)) is then 
transformed into TB n (x) with measure fik(TB n (x)) = fik(B n (x))J^ 1 (x') for 
some x 1 € B n {x), while the density of the transformed response with re- 
spect to /Ufc is fr(t) = /(T~ 1 (t)) JT(T _1 (t)). Accordingly, we cannot use 
the P-value Pt(/t(*) < /t(*o)) to assess whether or not to or, equivalently, 
xq = T~ 1 (to) is surprising since the density fr(t) depends on volume dis- 
tortions and the sets TB n (x) are no longer necessarily of equal volume. 
There is clearly an easy fix for this, however, as we simply correct for this 
volume distortion and then P T {f T (t) / J T (T~ l (t)) < f T {t )/J T (T- 1 {t))) = 
P(f( x ) < f( x o))- With this refinement, (3) becomes invariant under 1—1, 
smooth transformations of the response x, that is, we retain as part of the 
problem prescription how the continuous probability model is approximating 
an essentially discrete response. 

In general, however, T will not be 1—1. Suppose, then, that X is an open 
subset of R k and T is an open subset of R l , where I < k. Let fj> denote 
the density of Pt with respect to \x\ and suppose that this is continuous. 
Suppose that T is sufficiently smooth so that for each t S T, the set T -1 {i} 
is a smooth manifold with volume measure on T~ 1 {t} denoted by Vf As a 
simple example of this, suppose that T(x) = x\ + ■ ■ ■ + x^ so that T~ l {i\ is 
a sphere of radius t 1 / 2 , centered at the origin and vt is surface area measure. 
If T is 1-1, then T _1 {t} is a O-dimensional manifold and v% is counting 
measure. 
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Results in [16] show that, in general, 

(5) f T (t) = [ f(x)\det(dT(x)odT'(x))r 1/2 u t (dx), 

where dT is the differential of T. Formula (5) directly shows how is af- 
fected by volume distortions. For, at x G T~ 1 {t}, the contribution to the den- 
sity value fr(t) is distorted by the factor Jt(x) = \det(dT(x) o dT' \x))\~ l l 2 . 
Accordingly, just as we do in the 1-1 case, we adjust the integrand in (5) 
by dividing by the factor Jt(x) to obtain 

fHt) = [ f(x)vt(dx) 

as the appropriate density to use. A simple example of this occurs when T 
is /c-to-one, so T {t} = {xi(t), . . . ,Xk(t)} for each t. Then v± is counting 
measure and, by (5), fr(t) = J2i=i f{ x i{t))JT{xi(t)). In this case, we have 
that the corrected density is f T (t) = f( x i(t)) an d this equals f(T~ l (t)) 
when k = 1. In general, we see that f T is the density of Pt with respect to the 
measure (frit) / f T (t))fj,i(dt) and the ratio fr(t)/ f^(t) measures the effect 
of the volume distortion induced by T on the density /t- 
We then compute the P-value 

(6) P T {m) < mo)) = PUt(T(x)) < fHT(xo))) 

to assess whether or not to = ^(^o) is a surprising value from Pp. This P- 
value depends only on the density assignment / on the original response 
space, which is determined by how we are approximating an essentially dis- 
crete response, and the preimage sets of T. 

We have the following simple, but significant, result for (6). 

Theorem 2. If X is an open subset of R k , T:X^T is onto with 
T C R open and T is sufficiently smooth, then the P -value given by (6) 
is invariant under 1-1 smooth transformations on T ' ■ 

Proof. Suppose that W is a 1—1, smooth transformation defined on T 
and that w = W{t). Then (WoT)~ 1 {w} = T~ x {t} and f^ oT {w) = f T - 1{t} f(x) x 

v t (dx) = fi(t). □ 

We now consider some examples and note that these support (6) as the 
appropriate definition of an invariant P-value. 

Example 1 (Pt is discrete). First, suppose that P is discrete on count- 
able X. Then vt is counting measure on T~ 1 {t}, Jt(x) = 1 and, therefore, 
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/t(*) = lT-^{t}f( x ) v t(. dx ) = SxGT-i{t} p ( x = x ) =Pr(t) is the probabil- 
ity function of T. Hence, (6) equals (2). Note that dT is just the iden- 
tity, so there is no volume distortion. If P is continuous, then, for those 
t with pr(t) > 0, we have that u t is fik restricted to T~ l {t}. Accordingly, 
/t(*) = Jt-H*} f( x ) u t(dx) =pr(t) and, again, (6) equals (2). 

Example 2 (Jt(x) is constant for x G T _1 {i}). Note that Jt(x) is con- 
stant for all x whenever T(x) is an affine transformation. Hence, we could 
have T{x) = a + Bx for some a G R l and P G P /xfc when X C R k . Also, 
note that if x G P n and T{x) is the order statistic, then Jt{x) is constant 
for all x. It is then clear that (6) equals Pt(/t(£) < /t(*o))- For example, 
when T(x) = x, we simply use the density of x to compute the P-value so 
that when P is the iV(0, 1) distribution, (6) is 2(1 — 3>(x)). As another ex- 
ample, suppose that T is projection on the ith coordinate, so Jt{x) = 1. 
Then T _1 {t} is the set of points in X with ith coordinate equal to t, vt is 
Euclidean volume on this set and f^(t) is the marginal density of the ith 
coordinate. This generalizes to arbitrary coordinate projections. 

The volume distortion can be constant in T -1 {t} but vary with t. Putting 
J*(t) = J T (x) for x G T~ x {t}, from (5), we have that f T {t) = f£(t)J^{t), so 
(6) can be computed as Prifrit) / 'Jj-(i) < frfo)/ Jj>(to))- This permits us 
to avoid the integration involved in calculating f^(t) when we know the 
distribution of T and can compute Jt(x) easily. 

For example, suppose that T(x) = x'x. Then T _1 {t} is a sphere of dimen- 
sion (k - 1) in R k . Now, dT(x) = 2{x x • -'X k ), so dT(x) o dT'(x) = Ax'x = 
4t and Jt( x ) = l/2t 1 / 2 is constant for x G T" 1 -^} for every t. Note that 
the adjustment factor involves multiplying /r(i) by 2i ly/2 and this is pre- 
cisely the distortion caused by the "quadratic" part of the transforma- 
tion. The appropriate P-value is PrifT^t 1 ^ 2 — ./t^oHc/ 2 )- We see that 
in this case, we must modify the usual density that we work with. As a 
particular case, suppose that x ~ Nk(0,I). Then T(x) ~ x 2 (^) with density 
f T (t) = T- 1 (k)2- k /H k / 2 ">- l e- t/2 . Therefore, the invariant P-value is given 
by Pj"(t(' c ~ 1 )/ 2 e~ < / 2 < tQ fc ~ 1 ^ 2 e~*°/ 2 ) and only when k = 1 is this equivalent 
to Pr{t > to). In contrast, when we directly observe T ~ x 2 (^)i i n the sense 
that it is a measured variable, and we discretize using equal length inter- 
vals, the relevant P-value is JPr(i (fc/2)-1 e~*/ 2 < 4' c/2) " 1 e"* o/2 ). As was just 
shown, when we take into account that T arises as a transformation of a 
measured variable, the P-value changes. Further, both of these P-values are 
two-sided when k > 1. 

As we will see, Example 3 is a situation where Jt{x) varies with x G 



8 



M. EVANS AND G. H. JANG 



In Section 3, we discuss a situation that involves comparing the observed 
xq with the conditional distribution of x given that W(x) = W{xq) = wo for 
some smooth transformation W . In this case, the conditional density of x, 
with respect to volume measure on W~ 1 {wq}, is f{x)Jw(x)/fw(u>o) and it 
is clear that the volume distortion at x, induced by the conditioning, is given 
by Jw{x). Accordingly, the relevant P- value, based on the full data, is given 
by P(f(x)/f w (w ) < f(x )/fw(w )\W(x) = w ) = P(f(x) < f(x )\W(x) = 
Wo). If we have a transformation T of x, then the relevant P- value is as 
follows. 

Lemma 3. Suppose that X,W and T are manifolds with volume mea- 
sures fj,x,fJ-w an d ^T; respectively, and W : X — > W, T:X—>T are onto and 
smooth. Let UT,w,t,w denote volume measure on T~ l {t] n W {w}. The rel- 
evant conditional P -value based on T, given W(x) = wq, is then 

(7) PAfT,w(t\w ) < fl w (t \w )\W(x) = w ), 

where t = T(x ) and f^ w {t\w ) = f T - 1{t}nW - 1{m} f{x)v T:WAwo (dx). 

Proof. The conditional density of T given W = w is given by fT,w(t\ w ) = 
iT-^itynW-^iwyif^yfwiw^J^w^x^T^t^dx). So, the volume distor- 
tion induced by the transformations is J^ : w)( x ) an d the result follows. □ 

We will also need a technical result concerning the composition of map- 
pings. 

Lemma 4. Suppose that X,U and T are manifolds with volume measures 
VXiIMa an d HTj respectively, and U :X -^U,T :U — ?■ 7" are onto, smooth 
mappings. Then 

/tw(*) = / j t(u) / f(x)JTou( x ) J u( x ) l 'u,u( dx )vT,t( du )i 

where ujj,u and v T,t are the volume measures on U~ l {u} and T {t}, re- 
spectively. 

Proof. Suppose that g: X — ^i? 1 is nonnegative, f A g(x)nx(dx) is finite 
for compact A and B C T is open. By the measure decomposition theo- 
rem (see [16], Theorem 15.1) applied to g(x)(ix(dx) and ToU, we have that 
f x I B (T(U(x)))g(x)nx(dx) = f B J(ToU)-i{t}9( x ) J Tov(x)vTou,t(dx)iJ, T (dt). 
Apply the measure decomposition theorem to lB(T(U((x)))/j,x(dx) and U, 
and then to J^-ir^i lB(T(U(x)))g(x)Ju(x)vu,u(dx)/j,u(du) and T, to obtain 
j x I B (T(U(x)))g(x)fix(dx) = J u J I B (T(U(x)))g(x) J v {x) x 
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v UyU (dx)iiu{du) = J B J T _ 1{t} J u _ 1 ^ u yg(x)J v (x)u UtU (dx)J T (u)i/T,t(du)fir(dt). 

Then S(ToU)-l{t} g(x)JTou(x)vTouA dx )VT(dt) = J T -l {t} Ju-l {u} g{x)Ju{x) X 

vu,u(dx)JT(u)i>T,t(du)fi'f(dt), and setting g(x) = f(x)J^ u (x) establishes the 
result. □ 



3. Model checking. Suppose that we have a statistical model {Pg : 9 G 
0}, where Pg is a probability measure on X with density fg with respect to 
support measure fj,f-- We now investigate P- values for checking the model in 
light of the observed xq. 

If W : X — > W is a minimal sufficient statistic, then the conditional distri- 
bution of the data given W is independent of 9 and is denoted P(-\W(x) = 
wo). To check the model, we can then compare xo to P(-\W(x) = wo) to see 
if the observed data is surprising. By the converse of the factorization theo- 
rem, we have that fg(x) = gg(W(x))h(x). Lemma 3 and (7) give an invariant 
P- value that assesses fg for each 9. We have the following result. 

Theorem 5. For the statistical model {Pg:9 G ©} with minimal suffi- 
cient statistic W , the P -value (7), associated with discrepancy statistic T, 
equals 



(8) 



PT{h T yv{t\wo) < h T ,w(to\wo)\W(x) = w ), 



where hr,w{t\wo) = fT-^{t}nw- 1 {w } h{x)v t (dx). That is, it is independent 
of 9 and (8) is independent of the choice of h. 

Proof. In the continuous case, we assume that each density is contin- 
uous at any observed xq and restrict our attention to those xq for which 
f e (x ) > 0. If fg(xo) > 0, then gg(W(x )) > and gg(W(x)) = gg(W(x )) 
for the event W(x) = to = W{xo). We have that (7) equals 



Pt 



r 



fg(x)u t {dx) 

fg(x)v to {dx) 



\ 



L {OnW'- 1 {«<o} 
< 

V JT-i{t }nw- l {w } 
= P(hT,w{t\w ) < h T ,w(to\w )\W(x) = w ) 
Further, if g e (W(x))h(x) = g' 9 (W(x))h'(x), then 
PT(hT,w(t\w ) < h TtW (t \wo)\W(x) =w ) 
( 



W(x) = wq 



J 



Pt 



T- 1 {t}nw- 1 {w } 

■JT- 1 {t }nw- 1 {w } 



{W{x))h(x)is t {dx) 

ge{W(x))h{x)u to (dx) 



\ 



W(x) = Wq 



J 
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Pt 



g' e {W{x))h'{x)v t {dx) 
T- 1 {t}nw~ 1 {w } 



< / g' d (W(x))h'(x)v t0 (dx) 
\ JT- 1 {to}nW- 1 {wo} 



\ 

W{x) = wo 

I 



= P T (ti TW (t\wo) < h' TjW (t \w )\W(x) = w ) 
and we are done. □ 

We now consider an application of this result. 

Example 3 (Checking a normal model using the Jarque-Bera test statis- 
tic). Suppose that x = (xi, . . . ,x n ) is a sample of n from the N(/i,a 2 ) 
distribution with \i £ R 1 and a 2 > unknown. Then W(x) = (x,r), where 
r = \\x — xl n ||, is minimal sufficient. Putting d(x) = (x — xl n )/r, we can 
write x = xl n + rd and note that x,r and d are statistically independent 
with d uniformly distributed on S 71 " 1 n L^{l n }. In this case, h is constant 
(so we can take it to be 1) and W~ 1 {{xq, tq)} is the (n — 2)-dimensional 
sphere x l n + ^(S"" 1 n ^{ln})- 

It is natural here to consider functions of d as discrepancy statistics for 
checking the model. If T is a real-valued function of d, then h,T,w{t\ w o) 
is the volume of the (n — 3) -dimensional submanifold of xol n + ro(5 n_1 n 
L {l n }) given by {T od)~ 1 {i}C\W~ l {{xo,ro}. Alternatively, from the proof 
of Theorem 5, we can compute the invariant P-value by assuming (fA,cr) = 
(0,1), letting / denote the density of a sample of n from the iV(0, 1) dis- 
tribution and computing P {0A) (f£ od (T(d(x))) < f^ od {T(d(x )))\(x ,r )) = 
Pd{r Tod {T{d)) < tto d (T(do))), where f* od (t) = j (Tod) - 1{t} f{x)v t {dx) and d 

is uniformly distributed on 5 n ~ 1 n L {l n }. 

As an illustration, we consider a commonly used discrepancy statistic for 
checking normality, namely, the Jarque-Bera statistic T = n(nT 2 / '6 + (nT^ — 
3) 2 /24), where T p o d = Yli=i r ^^ ie statistic T is an attempt to create an 
omnibus test by combining the skewness and kurtosis statistics. The form 
is based on the asymptotic normality of {T^,T^) as this implies that T is 
asymptotically x 2 (2)- 

The volume distortion induced by T p (d) can be computed explicitly as 
JT p od( x ) = (r/P^P^idix^-T^dix^/n-T^dix)))- 1 / 2 . From this, we 
obtain that JTod( x ) for the Jarque-Bera statistic T satisfies Jrod( x )~ 2 = 
(n 4 /r 2 )[(nT 4 (d(x))/3 - l) 2 T 6 (d(x)) + 2(nT 4 (d(x))/3 - l)T 3 (d(x))T 5 (d(x)) - 
(T 2 (d(x)) + nT 2 (d(x))/3 - T A {d{x))) 2 + T 2 {d{x))T 4 {d{x)) - nT 2 (d(x)) x 
T 2 (d(x))/9]. We see that Jt(x) is not a function of To d and so the volume 
distortion is not constant within T~ 1 {t}. 

In Figure 1, we have plotted the densities fx and the invariant P- values 
based on the Jarque-Bera test statistic for several sample sizes n. The den- 
sities are quite irregular for small sample sizes and skewed. Note that, while 
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Fig. 1. Densities and invariant P '-values for Jarque-Bera test for various sample sizes 
n when sampling from normal. 

the formula for the Jarque-Bera statistic is simple, it is, in fact, a degree 
8 polynomial in the di. The volume distortion reflects this complexity and 
this is seen to have an effect, even for very large sample sizes. We estimated 
fr(t) and f^(t) = fT{t)E{J^ 1 (X)\T = t) via simulation using kernel density 
estimation methods. 

There is no reason to suppose, however, that the Jarque-Bera statistic 
represents the best way to combine T3 and T4. It seems more natural to set 
T = (T3, T4) and then compute (6), as this takes account of the dependence 
between T3 and T4. As illustrated in Figure 2, the P- values obtained via 
(6) and (3) with this T are very similar, indicating that volume distortion is 
playing a very small role, even for small sample sizes. Also, these P- values are 
much more critical of the normality assumption than the Jarque-Bera test 
and we note that these are based on the precise form of the joint distribution 
of T3 and T4. Finally, as n increases, these P- values converge to the same P- 
value as given by (1) using the Jarque-Bera statistic. Overall, this approach 
to combining T3 and T4 seems to have considerable advantages over the 
Jarque-Bera test. 

A similar analysis can be carried out for other discrepancy statistics. 
For example, the Shapiro-Wilk test has Jt(x) constant for x £ T~ 1 {t}. 
Correcting for volume distortion results in a small change in the usual P- 
values for small n and this effect disappears as n grows. Overall, the Shapiro- 
Wilk test is much better behaved than the Jarque-Bera test. Note that the 
Shapiro-Wilk test statistic is a quadratic function of the di. 
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— Density 

— Jarque-Bera 



1 1 1 1 1 

-2-10 1 2 

sqrt(n/24) (n T_4 - 3) 

Fig. 2. Contour of (T3,Ti) giving the 0.05 point for tests based on (6), (3) and the 
Jarque-Bera test, when n= 10. 

When U is an ancillary statistic, a function of U can be used to assess 
the model. If we consider the transformation T o U, then we must evalu- 
ate J^ ToU y 1 ^ fg(x)u t (dx) and, in general, there is no reason to suppose 
that this is independent of 9. If the distribution of T o U is discrete, how- 
ever, then J( ToU \-i /n fe{x)vt(dx) is the probability function of T o U and, 
as such, is independent of 6. Theorem 6 will show that the P- value based 
on L o[ /)-if ( i fe(,%) x v t {dx) is independent of 6 for a very broad class of 
ancillaries. 

Consider the following example which will serve as an archetype for a 
common situation where ancillaries arise. 

Example 4 (Location-scale models). Suppose that we have x S R n and 
that the model is x = \i\ n + az, where z is distributed with density / with 
respect to volume measure on R n and /i E i? 1 , cr > are unknown. Then 
x has density f il . u {x) = a~ n f((x — fil n )/a). We take the parameter space 
to be = {(fj,,a) : \i € R ,a > 0} and note that we have a group product 
defined on via (fj,±, ax)(fj,2, 02) = (/xi + (J\\i<i, ^1^2) ■ This group acts on 
R n via (fi,a)x = /il n + ax. A maximal invariant is then given by U(x) = 
(x — x* m l n ) I ! (x* 3 — where x* m is the sample median, that is, the middle 
sample value, when n is odd and the average of the two middle sample 
values when n is even, and x* qi is the ith quartile. Since U(x) is invariant, it 
is ancillary. Note that U~ l {u} = {x : x = al n + cu for some (a, c) G 0} = 0u, 
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that is, U~ l {u) is an orbit of the group action. Clearly, this orbit is half of 
a two-dimensional plane in R n and so the volume measure v u is just area. 
If we wish to base our checking on U itself, then we must evaluate 



fla,u(do) 



P f'OO f'OO 

/ fnA x ) v u{dx) = / / / M;(T (aln + cu)y/ndadc 

JU- 1 ^} JO J-oo 

= I" I" a ~ n f(^^l n + -u)^dadc 
Jo J-oo V ® o- ) 

poo f'OO 

= fJ -(«-2) / / f( a l n + cu)^dadc. 

Jo J-oo 

Accordingly, the P-value for model checking is given by 

Pu(f;,„,v(u)<f;, a ,u(uo)) 

(/•OO POO pOO f'OO 

/ / / (al n + cu) da dc < / / / (al n + cuq) da dc 
Jo J-oo Jo J-oo 

and, as this is independent of the model parameter, we have a valid P- 
value for checking the model. If, instead, we use a function T(U), then an 
application of Lemma 4 shows that f^Toii 1S independent of (fi, a), by the 
same argument, as the Jacobian factors do not depend on the parameter. 
Note that if / is the iV(0, 1) density, then basing model checking on the 
ancillary d or on the conditional distribution of the data given a minimal 
sufficient statistic produce the same results. 

More generally, suppose we have a group model {/ 9 :g£G}, where G is a 
group with a smooth product acting freely and smoothly on X and f g (x) = 
f(g~ 1 x)J g (g~ 1 x) for some fixed density /. Now, suppose that [•] : X — >G is 
smooth and satisfies [gx] = g[x] so that U(x) = [x] _1 x is a maximal invariant 
and is thus ancillary. Hence, u = U(x) G X , x = [x\U(x) and ?7 _1 {ti} is the 
orbit {gu:g£ G}. Now, if v* G denotes volume measure on G, we have that 
v u = K(u)vq for some positive function K. Let z = g~ l x so that [z] = g^ 1 [x] 
and let denote the Jacobian of the transformation [z] — > [x]. We 

then have f* gU {u) = {u} f g (x)u u (dx) = J {gu . g£G} f g ([x]u)v u (dx) = 

Now, if we can write J g (u)J*(z) = L(u)m(g) for some positive functions 
L and m, then we have that the invariant P-value Pu(f g u( u ) — f g u( u o)) 
is indeed independent of g. Further, by Lemma 4, this will also hold for 
T o JJ . In Example 4, J( /ti(T )(n) = a~ n and, with [x] = [x* m ,x* 3 — x*^, we 

have J* ([z]) = a 2 and this condition is satisfied. More generally, this condi- 
tion is satisfied in a wide range of group models, such as those discussed in 
[7]. Accordingly, the following result is broadly applicable. 
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Theorem 6. Suppose that {f g : g G G} is a family of densities with re- 
spect to volume measure fix on X , where G is a group with a smooth product 
and a smooth action defined on X , and f g (x) = f(g~ 1 x)J g (g~ 1 x). Further, 
suppose that there exists a smooth [■] :X — )■ G satisfying [gx] = g[x] and let 



then we have that the P -value (8) based on the ancillary ToU, with U(x) = 
[x] _1 x and T smooth, is independent of the model parameter and is thus a 
valid check on the model. 

4. Conclusions. The use of -P-values is a somewhat controversial topic. 
When we are concerned with model checking, however, their use seems un- 
avoidable. At least one problem with P-values is the ambiguity as to how 
they should be defined when we have data xq, a single probability measure 
P from which the data was supposedly generated and a discrepancy statistic 
T. While (1) has some appeal, the rationale for this is dependent on the form 
of the distribution Pp, namely, the right tail being the only region where 
surprising values of T(xq) may occur, and this clearly does not hold for an 
arbitrary T. For the discrete case, (2) seems like a much more appealing 
definition. A difficulty with (2) arises when we try to generalize this to ab- 
solutely continuous contexts since the simple analog of (2) is not invariant 
under 1—1, smooth transformations of T. 

We have argued that an appropriate, generalized definition of an invariant 
P- value can be obtained from (2). For this, we must take into account that 
a continuous model is essentially an approximation to a true discrete model 
as we always measure responses to some finite accuracy. Further, we must 
not allow volume distortions induced by a discrepancy statistic T to have 
any effect on our inferences. This leads us to the invariant P- value given 
by (6). The definition is seen to be intuitively reasonable and to perform 
well in a variety of examples. For other model checking problems, where our 
approach to P-values seems applicable, see [1, 2, 8] and [11]. 

Acknowledgments. We wish to thank the reviewers for a number of sug- 
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[1] Bayarri, M. J. and Berger, J. O. (2000). P-values for composite null models. J. 

Amer. Statist. Assoc. 95 1127-1142, 1157-1170. With comments and a rejoinder 

by the authors. MR1804239 
[2] Bayarri, M. J. and Castellanos, M. E. (2007). Bayesian checking of the second 

levels of hierarchical models. Statist. Sci. 22 322-343. With comments and a 

rejoinder by the authors. MR2416808 
[3] Berger, J. O. and Delampady, M. (1987). Testing precise hypotheses. Statist. Sci. 

2 317-352. With comments and a rejoinder by the authors. MR0920141 




REFERENCES 



INVARIANT P- VALUES 



15 



[4] Berger, J. O. and Sellke, T. (1987). Testing a point null hypothesis: Irrecon- 
cilability of P-values and evidence. J. Amer. Statist. Assoc. 82 112-139. With 
comments and a rejoinder by the authors. MR0883340 

[5] Bernardo, J. M. and Smith, A. F. M. (1994). Bayeswn Theory. Wiley, New York. 
MR1274699 

[6] Evans, M. and Jang, G.-H. (2008). Invariant P-values for model checking and 
checking for prior-data conflict. Techincal Report 0803, Dept. Statistics, Univ. 
Toronto. 

[7] Fraser, D. A. S. (1979). Inference and Linear Models. McGraw-Hill, New York. 
MR0535612 

[8] Gelman, A., Meng, X.-L. and Stern, H. (1996). Posterior predictive assessment 
of model fitness via realized discrepancies. Statist. Sinica 6 733-807. With com- 
ments and a rejoinder by the authors. MR1422404 
[9] Hall, P. and Selinger, B. (1986). Statistical significance: Balancing evidence 
against doubt. Austral. J. Statist. 28 354-370. 
[10] Loomis, L. H. and Sternberg, S. (1968). Advanced Calculus. Addison- Wesley, 

Reading, MA. MR0227327 
[11] Meng, X.-L. (1994). Posterior predictive p-values. Ann. Statist. 22 1142-1160. 
MR1311969 

[12] MORRISON, D. E. and Henkel, R. E. (1970). The Significance Test Controversy — A 

Reader. Aldine Publishing, Chicago. 
[13] Royall, R. M. (1997). Statistical Evidence: A Likelihood Paradigm. Monographs on 

Statistics and Applied Probability 71. Chapman and Hall, London. MR1629481 
[14] Rudin, W. (1974). Real and Complex Analysis, 2nd ed. McGraw-Hill, New York. 

MR0344043 

[15] Schervish, M. J. (1996). P-values: What they are and what they are not. Amer. 
Statist. 50 203-206. MR1422069 

[16] Tjur, T. (1974). Conditional Probability Distributions. Lecture Notes 2. Univ. Copen- 
hagen, Copenhagen. MR0345151 

Department of Statistics 
University of Toronto 
Toronto, ON M5S 3G3 
Canada 

E-MAIL: mevans@utstat.utoronto.ca 
gunho@utstat.utoronto.ca 



